feat(prepro): basic validation of raw read submissions#6773
feat(prepro): basic validation of raw read submissions#6773maverbiest wants to merge 11 commits into
Conversation
Partly resolves #6758 Currently, user-submitted data are not read in by the nextclade preprocessing pipeline, causing them to be dropped form the submission process. Since we're now working to implement file sharing on Loculus, we need to make it so user-submitted files make it through the preprocessing pipeline as well. This PR makes it so user-submitted files are forwarded through the nextclade pipeline without any processing or checking of file contents. Future PRs will add functionality to, for example, check for host sequences in submitted raw reads. ## Implementation - `parse_ndjson` now also parses the file related information sent to preprocessing by the backend - `UnprocessedData` and `UnprocessedAfterNextclade` both get a `files` attribute - test factories in `factory_methods.py` can now be given files to add to test objects, also added a test case in `test_nextclade_preprocessing.py` that carries file information ## Manual testing I created a preview to test whether files now make it through the nextclade pipeline. When I submit sequences to the preview with attached raw reads, this file now appears in the submission review page: <img width="1594" height="581" alt="grafik" src="https://github.com/user-attachments/assets/45a2c3a7-ef9a-4738-aec7-be8ef5717a1b" /> And also on the sequence details page after the sequence is released: <img width="1124" height="595" alt="grafik" src="https://github.com/user-attachments/assets/578e16c0-6f0e-4e9b-b5e2-2c544b826086" /> ## Open questions One thing I ran in to when implementing this is that you need to add file categories two times in the config: one time under `submissionDataTypes` (file categories that users are allowed to submit) and then again under a top-level `files` field (files accepted as outputs of prepro pipelines): ``` defaultOrganismConfig: &defaultOrganismConfig schema: &schema submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1 files: enabled: true categories: - name: raw_reads displayName: Raw reads ... files: - name: annotations displayName: Annotations - name: raw_reads displayName: Raw reads ``` Would it be nicer to always allow file categories listed under `submissionDataTypes.files` to be output but prepro? Or will we ever have cases where users submit one thing, prepro processes it, and then outputs another filetype? Probably safest to keep as-is for now but just wanted to flag since not doing this properly got me into a weird state where submissions stay in 'processing' indefinitely but never error because the backend doesn't accept the preprocessing output (it only logs errors in the backend). ### PR Checklist ~- [ ] All necessary documentation has been adapted.~ - [x] The implemented feature is covered by appropriate, automated tests. - [x] Any manual testing that has been done is documented (i.e. what exactly was tested?) 🚀 Preview: https://pass-files-through.loculus.org
|
Claude finished @maverbiest's task in 2m 46s —— View job Code Review
Overall this is a clean, well-scoped PR. The architecture choices (typed enum instead of plain string, dispatching via One bug to fix before merge:
Other observations (inline comments posted):
|
| rr_errors, rr_warnings = validate_raw_reads_submission(files) | ||
| errors.extend(rr_errors) | ||
| warnings.extend(rr_warnings) | ||
| case _: |
There was a problem hiding this comment.
FileCategory.ANNOTATIONS is currently in the enum (added for internal use by upload_flatfiles), so if the backend ever sends annotations as an input category, this branch would log a warning. Worth noting for future maintainers that ANNOTATIONS is preprocessing-generated output, not a user-submittable input category — so it should never appear in unprocessed.files.
There was a problem hiding this comment.
Yes currently when we add a new input file category, this match statement will need to be updated to handle it. That is as intended (see discussion of alternative in the PR description)
There was a problem hiding this comment.
I'm wondering if this should emit a hard exception, as (if my understanding is correct) a file category could be added to the values.yaml, and so will be accepted in the backend, but then if its missing in this match statement, it would pass through without validation?
There was a problem hiding this comment.
I guess a hard exception would prevent other categories being allowed for other users of loculus which might not be a good thing
There was a problem hiding this comment.
This file caught some stray updates from running the formatter
tombch
left a comment
There was a problem hiding this comment.
Tested it out and this looks good to me!
| return output_metadata, errors, warnings | ||
|
|
||
|
|
||
| def process_submitted_files( |
There was a problem hiding this comment.
Maybe worth moving the file related functions (this one and validate raw reads) to a new file to not grow this 2k line monstrosity even further
| class AnnotationSourceType(StrEnum): | ||
| METADATA = "Metadata" | ||
| NUCLEOTIDE_SEQUENCE = "NucleotideSequence" | ||
| SUBMITTED_FILE = "SubmittedFile" |
There was a problem hiding this comment.
I think this name is a bit confusing as it implies this is a file submitted by the user but annotations are created by the prepro pipeline, I think just "File" is ok
| warnings: list[ProcessingAnnotation] = [] | ||
|
|
||
| if len(files) > 2: # noqa: PLR2004 | ||
| message = f"Raw reads must be submitted as one or two files, got {len(files)}" |
There was a problem hiding this comment.
This is a message for submitters so I would actually state we want to have paired-end or single-end raw reads, and thus accept a max of 2 files.
anna-parker
left a comment
There was a problem hiding this comment.
some small improvements but also looks good overall!
|
Regarding the alternative suggestion, I like it but I would hold off for now - its not required for the beta and we can make the config a bit nicer in a later step, you can create an issue for this though to track it :-) |
Partially resolves #6758
This PR adds general functionality to validate user-submitted files in the nextclade preprocessing pipeline. How files are processed is determined based on the
FileCategoryand a simple switch statement.The PR introduces a function to validate a submission of raw sequencing reads as a first use case:
The validation is very basic for the moment, and only considers the number of submitted files and their files extensions. This could be built out in the future to include more robust checks.
Alternative approaches
Rather than having a switch statement on
FileCategoryin prepro to determine how different files are processed, we could define this in thevalues.yaml, similar to how we do for metadata fields. E.g.:I decided against this for now to get a working version with a minimal diff, but I would be happy to implement this if people feel it's better.
PR Checklist
🚀 Preview: https://simple-fq-validation.loculus.org